This paper investigates, from information theoretic grounds, a learningproblem based on the principle that any regularity in a given dataset can beexploited to extract compact features from data, i.e., using fewer bits thanneeded to fully describe the data itself, in order to build meaningfulrepresentations of a relevant content (multiple labels). We begin byintroducing the noisy lossy source coding paradigm with the log-loss fidelitycriterion which provides the fundamental tradeoffs between the\emph{cross-entropy loss} (average risk) and the information rate of thefeatures (model complexity). Our approach allows an information theoreticformulation of the \emph{multi-task learning} (MTL) problem which is asupervised learning framework in which the prediction models for severalrelated tasks are learned jointly from common representations to achieve bettergeneralization performance. Then, we present an iterative algorithm forcomputing the optimal tradeoffs and its global convergence is proven providedthat some conditions hold. An important property of this algorithm is that itprovides a natural safeguard against overfitting, because it minimizes theaverage risk taking into account a penalization induced by the modelcomplexity. Remarkably, empirical results illustrate that there exists anoptimal information rate minimizing the \emph{excess risk} which depends on thenature and the amount of available training data. An application tohierarchical text categorization is also investigated, extending previousworks.
展开▼